Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes
نویسندگان
چکیده
CpG islands are GC-rich regions often located in the 5' end of genes and normally protected from cytosine methylation in mammals. The important role of CpG islands in gene transcription strongly suggests evolutionary conservation in the mammalian genome. However, as CpG dinucleotides are over-represented in CpG islands, comparative CpG island analysis using conventional sequence analysis techniques remains a major challenge in the epigenetics field. In this study, we conducted a comparative analysis of all CpG island sequences in 10 mammalian genomes. As sequence similarity methods and character composition techniques such as information theory are particularly difficult to conduct, we used exact patterns in CpG island sequences and single character discrepancies to identify differences in CpG island sequences. First, by calculating genome distance based on rank correlation tests, we show that k-mer and k-flank patterns around CpG sites can be used to correctly reconstruct the phylogeny of 10 mammalian genomes. Further, we used various machine learning algorithms to demonstrate that CpG islands sequences can be characterized using k-mers. In addition, by testing a human model on the nine different mammalian genomes, we provide the first evidence that k-mer signatures are consistent with evolutionary history.
منابع مشابه
Predicting CpG Islands and Their Relationship with Genomic Feature in Cattle by Hidden Markov Model Algorithm
Cattle supply an important source of nutrition for humans in the world. CpG islands (CGIs) are very important and useful, as they carry functionally relevant epigenetic loci for whole genome studies. As a matter of fact, there have been no formal analyses of CGIs at the DNA sequence level in cattle genomes and therefore this study was carried out to fill the gap. We used hidden markov model alg...
متن کاملComparative Analysis of DNA Word Abundances in Four Yeast Genomes Using a Novel Statistical Background Model
Previous studies have shown that the identification and analysis of both abundant and rare k-mers or "DNA words of length k" in genomic sequences using suitable statistical background models can reveal biologically significant sequence elements. Other studies have investigated the uni/multimodal distribution of k-mer abundances or "k-mer spectra" in different DNA sequences. However, the existin...
متن کاملComparative architectures of mammalian and chicken genomes reveal highly variable rates of genomic rearrangements across different lineages.
Molecular evolution studies are usually based on the analysis of individual genes and thus reflect only small-range variations in genomic sequences. A complementary approach is to study the evolutionary history of rearrangements in entire genomes based on the analysis of gene orders. The progress in whole genome sequencing provides an unprecedented level of detailed sequence data to infer genom...
متن کاملThe Evolution of Mammalian Genomic Imprinting Was Accompanied by the Acquisition of Novel CpG Islands
Parent-of-origin-dependent expression of imprinted genes is mostly associated with allele-specific DNA methylation of the CpG islands (CGIs) called germ line differentially methylated regions (gDMRs). Although the essential role of gDMRs for genomic imprinting has been well established, little is known about how they evolved. In several imprinted loci, the CGIs forming gDMRs may have emerged wi...
متن کاملMapping the zebrafish brain methylome using reduced representation bisulfite sequencing
Reduced representation bisulfite sequencing (RRBS) has been used to profile DNA methylation patterns in mammalian genomes such as human, mouse and rat. The methylome of the zebrafish, an important animal model, has not yet been characterized at base-pair resolution using RRBS. Therefore, we evaluated the technique of RRBS in this model organism by generating four single-nucleotide resolution DN...
متن کامل